Skip to content

[FLINK-39902][tests] Fix race in RescaleTimelineITCase.testRescaleTerminatedByJobFinished#28378

Merged
RocMarshal merged 1 commit into
apache:masterfrom
MartijnVisser:flink-39902-rescale-jobfinished
Jun 13, 2026
Merged

[FLINK-39902][tests] Fix race in RescaleTimelineITCase.testRescaleTerminatedByJobFinished#28378
RocMarshal merged 1 commit into
apache:masterfrom
MartijnVisser:flink-39902-rescale-jobfinished

Conversation

@MartijnVisser

Copy link
Copy Markdown
Contributor

What is the purpose of the change

RescaleTimelineITCase.testRescaleTerminatedByJobFinished is flaky on slow CI. The test requests an upscale to a parallelism that exceeds the available slots, so the rescale never changes the running parallelism and is only observable as a recorded history entry. It then unblocks the no-op task immediately, racing the scheduler recording that second rescale: on a slow machine the job finishes before the rescale is recorded, the history stays at size 1, and the size-2 wait times out.

Brief change log

  • Wait until the second rescale has been recorded (history size 2) before unblocking the task, so the in-progress rescale resolves to JOB_FINISHED once the job finishes.
  • Move the assumeThat(enabledRescaleHistory(...)) skip ahead of the requirement update so the disabled-history variant skips cleanly.

Verifying this change

Existing assertions are unchanged; the fix only enforces the ordering the sibling tests already get. Verified by running testRescaleTerminatedByJobFinished repeatedly in a loop locally without failure.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

Was generative AI tooling used to co-author this PR?
  • Yes (Claude Opus 4.8 (1M context))

Generated-by: Claude Opus 4.8 (1M context)

…minatedByJobFinished

Unblocking the task raced with the scheduler recording the second rescale; on
a slow machine the job finished first, leaving the history at size 1 and
timing out the wait. Wait for the rescale to be recorded before unblocking.

The assumeThat(enabledRescaleHistory) check had to move before the update RPC
because the new size-2 history wait is only meaningful when rescale history is
enabled; for the disabled parameter the history never grows and the wait would
hang.

Generated-by: Claude Opus 4.8 (1M context)
@flinkbot

flinkbot commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@RocMarshal RocMarshal left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MartijnVisser
LGTM +1.

@RocMarshal RocMarshal merged commit ef4d88e into apache:master Jun 13, 2026
lihaosky pushed a commit that referenced this pull request Jun 13, 2026
…minatedByJobFinished (#28378) (#28413)

Unblocking the task raced with the scheduler recording the second rescale; on
a slow machine the job finished first, leaving the history at size 1 and
timing out the wait. Wait for the rescale to be recorded before unblocking.

The assumeThat(enabledRescaleHistory) check had to move before the update RPC
because the new size-2 history wait is only meaningful when rescale history is
enabled; for the disabled parameter the history never grows and the wait would
hang.

Generated-by: Claude Opus 4.8 (1M context)
(cherry picked from commit ef4d88e)

Co-authored-by: Martijn Visser <2989614+MartijnVisser@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants